Automate your data science project structure in three easy steps

The article was originally published here

_{Free Vector illustrations from Scale}

Good Code is its own best documentation

Dr. Rachael Tatman, in one of her presentations, highlighted the importance of code reproducibility in a very subtle way :

“Why should you care about reproducibility? Because the person most likely to need to reproduce your work… is you.”

This is true on so many levels. Have you ever found yourself in a situation where it became difficult to decipher your codebase? Do you often end up with multiple files like untitled1.py or untitled2.ipynb? Well, if not all, a few of us must have undoubtedly faced the brunt of bad coding practices on few occasions. The situation is even more common in data science. Often, we limit our focus on the analysis and the end product while ignoring the quality of the code that is responsible for the analysis.

Why is reproducibility a vital ingredient in the data science pipeline? I have touched upon this topic in another blog post, and I’ll borrow a few lines from there. A reproducible example allows someone else to recreate your analysis using the same data. This makes a lot of sense since you put your work out in the public for them to use. This purpose gets defeated if others cannot reproduce your work. In this article, let’s look at three useful tools that can streamline and help you in creating structured and reproducible projects.

Creating a good project structure

Let’s say you want to create a project which contains code to analyze the sentiments of the movie reviews. There are three essential steps to create a good project structure:

_{The pipeline of creating a project template | Image by Author}

1. Automating project template creation with Cookiecutter Data Science

![ _{Icon by @NounProject| CC: Creative Commons](https://cdn-images-1.medium.com/max/2000/1*Gw61dGYG48pd5VO3srkgGg.png)}

There is not a clear consensus in the community on best practices for organizing machine learning projects. That is why they are a plethora of choices, and this lack of clarity leads to confusion. Fortunately, there is a workaround, thanks to people at DrivenData. They have created a tool called Cookiecutter Data Science which is a standardized but flexible project structure for doing and sharing data science work. A few lines of code set up a whole series of subdirectories and make it easier to start, structure, and share analysis. You can read more about the tool on their project home page. Let’s get to in interesting part and see it in action.

Installation

pip install cookiecutter

or

conda config --add channels conda-forge
conda install cookiecutter

Starting a new project

Head over to your terminal and run the following command. It will automatically populate a directory with the required files.

cookiecutter [https://github.com/drivendata/cookiecutter-data-science](https://github.com/drivendata/cookiecutter-data-science)

_{Using Cookiecutter DataScience | Image by Author}

A sentiment Analysis directory gets created on the specified path, which in the above case is the Desktop.

The directory structure of the newly created project | Image by Author

Note : Cookiecutter data science will be moving to version 2 soon, and hence there slight change in how the command is used in the future. This means you will have to use ccds … rather than cookiecutter … *in the command above. As per the Github repository, this version of the template will still be available but one would have to explicitly use *-c v1 to select it. Keep an eye on the documentaion, when the change happens.

Creating a good Readme with readme.so

![Icon by @NounProject

CC: Creative Commons](https://cdn-images-1.medium.com/max/2000/1*pFC6ZB0FMVita5OHWVpF_g.png)

After creating the skeleton of the project next, you need to populate it. But before that, there is an important file to be updated — the README. A README is a markdown file that communicates essential information about your project. It tells others what the project is about, the project’s license, how others can contribute to the project, etc. I have seen many people putting tremendous effort into their projects but failing to create decent READMEs. If you are one of them, there is some good news in the form of a project called [readme.so](https://readme.so/).

A good soul has just put an end to writing READMEs manually. Katherine Peterson recently created a simple editor allowing you to create and customize your project’s readme quickly.

The editor is pretty intuitive. You only need to click on a section to edit the content, and the section gets added to your readme. Choose the ones you like from an extensive collection. You can also move the sections depending upon the location where you want them on the page. Once you have all things in place, go ahead and copy the content or download the file and add it to your existing project.

![Generate automatic READMEs with readme.so

Image by Author](https://cdn-images-1.medium.com/max/3126/1*Le7xvk0HTxsGR-_xb6eNvA.gif)

Push your code to Github

_{Icon by @NounProject | CC: Creative Commons}

We are almost done. The only thing left is to push the code to Github(or any version control platform of your choice). You can do that easily via Git. Here is a handy cheat sheet containing the most important and commonly used Git commands for easy reference.

Alternatively, if you use Visual Studio Code(VS Code), like me, it is already taken care of. VS Code makes it possible to publish any project directly to GitHub without having to create a repository first. VS Code will create the repository for you and control whether or not it should be public or private. The only thing required from your side is to provide authentication to GitHub through VS Code.

![Pushing Code to Github via Visual Studio Code

Image by Author](https://cdn-images-1.medium.com/max/3588/1*XCJ5_8uE1n8B8jf3iUWRAw.gif)

That is all you need to set up a robust and structured project base. All the above steps have been summarized in the following video in case you want to look at all the steps in sync.

Conclusion

Creating structured and reproducible projects might seem difficult in the beginning but offer advantages in the long run. In this article, we looked at three useful tools that can help us in this task. While cookiecutter data science gives a clean project template, readme.so automatically populates a readme file. Finally, the VS Code can help us push the project onto the web for source control and collaboration. This creates the necessary foundation for a good data science project. Now you can begin working on your data and derive insights from it to be shared with various stakeholders.